Setup
Link to scenario: https://www.katacoda.com/orm-mfrigaard/scenarios/03-effective-storytelling
The index.json configuration file is set using the following environment:
"environment": {
"uilayout": "editor-terminal-split"
},
"backend": {
"imageid": "rlang:3.4"
}Outline
Below are the 20 steps (plus intro and finish) files in the scenario.
Objectives
The objectives for this scenario are:
recognize the needs of your audience (data literacy, level of expertise, etc.)
check and communicate data quality with stakeholders
identify the correct data visualization (single variable, bivariate, and multivariate graphs) based on the data
incorporate feedback from stakeholders/audience into graphs
design visualizations with the appropriate detail and annotations that inform (and do not mislead) the audience
The learners
The learners I’m expecting to be participating in this course are:
Jessica will take this class on her own time for professional development.
Peter will this course in a two-day weekend because he needs to complete a project.
Bruce is using these scenarios to supplement a semester-long undergraduate course on R.
Jane has been told to take this course for his job because his team is using R.
intro
- included in intro.md?
Welcome!
Welcome to ‘Combine narrative with numbers for effective storytelling in R’! In this scenario, we will cover how to build data visualizations that effectively communicate and engage your audience.
Now that we have some experience with data wrangling with tidyr and dplyr, and data visualization with ggplot2, we can put these tools together to get your point across to stakeholders and audiences.
What makes a bad graphic?
Bad graphs aren’t just ugly; they’re misleading. A chart can have pretty colors and novel font styling, but that can’t make up for inaccurately presenting data. Mislabeling axes, using inappropriate scales or labels, and including unnecessary elements (‘chart junk’) are common characteristics of bad graphics.
We’ll be using the terms ‘graph’, ‘figure,’ and ‘chart’ interchangeably throughout this scenario.
Claus Wilke describes the difference between ugly, bad, and wrong graphs in his excellent text, Fundamental of Data Visualization.
- An ugly graph isn’t aesthetic appealing, but the data presented is clear and informative
- A bad graph fails to communicate the information it contains because of poor design
- A graph is wrong if it’s mathematically incorrect (the underlying calculations, representations, or transformations are inaccurate).
We want to avoid making ugly, bad, and wrong graphs.
What makes a great graphic?
Have you ever heard a joke and thought, “it’s funny because it’s true”?
We all know a great data visualization when we see it, but can you explain why it’s so great? Beyond just being aesthetically appealing (the choice of color palettes, fonts, etc.), data visualizations are tools for communicating complicated information.
Well, great graphics are similar to great jokes in this way–they should reveal a complicated ‘truth’ that was otherwise difficult to comprehend or articulate with words alone.
This scenario will show you how to make sure your graphs and figures communicate complexity effectively without misleading your audience.
step 1
- included in step1.md?
- grammar?
- spelling?
Things to consider about your audience
You’ll need to determine who the audience or stakeholders will be before creating graphs or figures. You’ll likely create multiple charts throughout a data analysis that you won’t include as a final deliverable. But in most cases, our stakeholders or audience is whoever is getting the final results or product.
The final graphs you produce will depend on 1) the question we’re trying to answer, and 2) the level of statistical literacy of our audience.
If your audience isn’t familiar with particular data visualizations, provide them with enough information to interpret the graph (and check their understanding). However, if you find yourself spending more time explaining a data visualization’s design than what the visualization reveals, we should consider a different graph.
Asking questions
Data visualization should be an iterative process, and getting regular feedback from your audience will help you understand their point of view. It will also help manage their expectations regarding the final deliverable.
As Hilary Mason and D.J. Patil point out in their 2015 text, Data Driven: Creating a Data Culture, asking the right questions “involves domain knowledge and expertise, coupled with a keen ability to see the problem, see the available data, and match up the two.”
The questions below can help guide your project and make sure you understand what your audience/stakeholders are expecting:
- What question(s) is this project trying to answer? (or What problem(s) is this project trying to solve?)
- Do we have access to the data to answer the question/problem posed? (or do we need to gather more data?)
- What is the current format/structure/location of the data? (this will have an enormous impact on the project timeline!)
- What context will we present the deliverable(s) in? (slide deck, website, report, etc.)
- How familiar will the audience be with the data in our project? (how much background information should we be providing?)
We recommended having the answers to these questions documented somewhere, as this will 1) keep your project focused and timely, 2) ensure both you and your client/customer have a clear vision for successful completion.
You want to fully understand what you are visualizing, who the audience will be, and why they will care about the results.
step 2
- included in step2.md?
- grammar?
- spelling?
Data lineage: the background on our data
It’s best to start a project off with a ‘view of the forest from outside the trees’. The technical term for this is data lineage, which
“includes the data origin, what happens to it, and where it moves over time.”
Having a “birds” eye view’ of the data ensures there weren’t any problems with exporting or importing. Data lineage also means understanding where the data are coming from–was it collected from an internal relational database, an external vendor, or did it come from the web or social media?
Knowing some of the technical details behind a dataset lets us frame the questions or problems we’re trying to tackle. In this scenario, we will use a variety of data sources, but they will all be [tabular](https://en.wikipedia.org/wiki/Table_(information). Tabular data, like spreadsheets, organize data into columns and rows. R can handle multiple kinds of data, but that is a topic that extends beyond the scope of this scenario.
Initiate R
Let’s load some data and get started! Launch an R console by clicking here -> R (Click on the Run command icon)
Load packages
The package we’ll use to view the entire datasets with R is skimr. We will install and load these packages below:
install.packages(c("tidyverse", "skimr"))
library(tidyverse)
library(skimr)step 3
- included in step3.md?
- grammar?
- spelling?
Before you start: what do we expect to see?
Before starting a new project, we want to set some expectations. The questions we covered in the previous step help us understand what kind of data we’ll be encountering. Sometimes we’ll be dealing with unknown data, but we should know approximately how many columns and rows the new dataset will contain. We might know some basic information about the variable formats, too.
For example, we should see if we’re getting date columns (YYYY-MM-DD), logical (TRUE, FALSE, NA), numerical measurements (integer (1L) or double (1)), or categorical data (character (male and female) or factor (’low,medium,high`)).
Baseball data
We’re going to load a dataset to demonstrate a few ways to investigate a dataset’s quality (or how well it matches our expectations).
These data come from Sean Lahman’s Baseball Database.
Now, I am not going to assume everyone participating in this scenario is familiar with baseball. However, this exercise is arguably more rewarding if you are not a baseball fan. If you’re working with data, part of your job to be interested in whatever you’ve been asked to analyze (even if it is only for the monetary reward).
“…if you want to work in data visualisation, you need to be relentlessly and systematically curious. You should try to get interested in anything and everything that comes your way.” - Alberto Cairo, Knight Chair in Visual Journalism, University of Miami
Analyzing and visualizing data you’re not familiar with is a chance to learn something new, and it puts you in a position to ask ‘out of the box’ questions.
Doing your homework
It’s also essential to read any accompanying documentation for new datasets. If we read the documentation on the Lahman website, we find out that People contains “Player names, DOB, and biographical info.” The variables in People are presented below:
People table
playerID = A unique code assigned to each player. The playerID links the data in this file with records in the other files. birthYear = Year player was born
birthMonth = Month player was born
birthDay = Day player was born
birthCountry = Country where player was born
birthState = State where player was born
birthCity = City where player was born deathYear = Year player died
deathMonth = Month player died
deathDay = Day player died
deathCountry = Country where player died
deathState = State where player died
deathCity = City where player died
nameFirst = Player’s first name
nameLast = Player’s last name
nameGiven = Player’s given name (typically first and middle)
weight = Player’s weight in pounds
height = Player’s height in inches
bats = Player’s batting hand (left, right, or both)
throws = Player’s throwing hand (left or right)
debut = Date that player made first major league appearance
finalGame = Date that player made first major league appearance (blank if still active)
retroID = ID used by retrosheet
bbrefID = ID used by Baseball Reference website
Most of the data pre-processing steps center around a single question: Is this what I expected to see? Reading the documentation gives you expectations about the data to confirm or refute (and then investigate).
Now that we have some background information on this new dataset, we will look at how well People meets our expectations.
Whenever we get a new data source, we should try to view the data in its native format (if possible). We can view the raw data on the Github repository.
Load data
Fortunately, we are also able to load the raw data directly into R using the readr::read_csv() function. We will load the People dataset into R using readr::read_csv(), and assign "https://bit.ly/3scsHw7" to the file argument.
People <- readr::read_csv(file = "https://bit.ly/3scsHw7")step 4
- included in step4.md?
-
grammar
- spelling
Are we seeing what we expected? (1)
Before creating any visualizations, we want a display that gives us an overview of the entire People dataset. In the previous step, we went over some of the People dataset documentation, so we know what to expect.
Skimming data
We’ll be using the skimr package. skimr was is designed for:
“displaying summary statistics the user can skim quickly to understand their data”
Below we pass the People dataset to the skimr::skim() function to create PeopleSkim. We then use the base::summary() function to review the new object.
If this code looks unfamiliar to you, review the Introduction to ggplot2 scenario.
PeopleSkim <- People %>%
skimr::skim()
summary(PeopleSkim)| Name | Piped data |
| Number of rows | 20090 |
| Number of columns | 24 |
| _______________________ | |
| Column type frequency: | |
| character | 14 |
| Date | 2 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
The output above shows a high-level summary of all the variables in the People dataset. We can see there are 20090 rows and 24 columns (‘14’ columns are character's,2columns areDate’s, and 8 are numeric).
Viewing character variables
The new PeopleSkim object gives us summary information to check against the documentation and help guide our data visualizations. We will start by viewing the variables according to their types in People using skimr::yank() (read the function documentation on Github). The skim_type argument in skimr::yank() takes a variable type ("character", "numeric", or "Date").
Run the code below to use skimr::yank() to view a skim of the character variables in the People dataset.
PeopleSkim %>%
skimr::yank(skim_type = "character")Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| playerID | 0 | 1.00 | 5 | 9 | 0 | 20090 | 0 |
| birthCountry | 61 | 1.00 | 3 | 14 | 0 | 57 | 0 |
| birthState | 535 | 0.97 | 2 | 22 | 0 | 295 | 0 |
| birthCity | 174 | 0.99 | 3 | 26 | 0 | 4880 | 0 |
| deathCountry | 10235 | 0.49 | 3 | 14 | 0 | 25 | 0 |
| deathState | 10285 | 0.49 | 2 | 20 | 0 | 107 | 0 |
| deathCity | 10240 | 0.49 | 2 | 26 | 0 | 2676 | 0 |
| nameFirst | 37 | 1.00 | 2 | 14 | 0 | 2522 | 0 |
| nameLast | 0 | 1.00 | 2 | 14 | 0 | 10219 | 0 |
| nameGiven | 37 | 1.00 | 2 | 43 | 0 | 13325 | 0 |
| bats | 1180 | 0.94 | 1 | 1 | 0 | 3 | 0 |
| throws | 977 | 0.95 | 1 | 1 | 0 | 3 | 0 |
| retroID | 56 | 1.00 | 8 | 8 | 0 | 20034 | 0 |
| bbrefID | 2 | 1.00 | 5 | 9 | 0 | 20088 | 0 |
We can see none of these data are missing (n_missing and complete_rate). Skimr::skim() also shows us the min, max, empty, n_unique, and whitespace for the 14 character values.
Viewing date variables
Next, we use skimr::yank() to view a skim of the Date variables in the People dataset.
PeopleSkim %>%
skimr::yank(skim_type = "Date")Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| debut | 198 | 0.99 | 1871-05-04 | 2020-09-27 | 1965-04-14 | 10572 |
| finalGame | 198 | 0.99 | 1871-05-05 | 2020-09-27 | 1971-05-06 | 9480 |
The skim of the Date variables shows us which data are missing (n_missing and complete_rate), along with the earliest (min), latest (max), and middle (median).
The number of unique (n_unique) dates prints to the next line. This behavior is because the terminal window has a width limit. If the Terminal output extends past this limit, the content gets printed to the line below.
Do these numbers make sense?
We can use these values for sanity checks. For example, the n_unique for playerID matches the total number of rows in People, which we should expect from the documentation (playerID = “A unique code assigned to each player”). The earliest dates for both debut and finalGame are in May of 1871 (which corresponds to the first MLB game ever played).
step 5
- included in step5.md?
-
grammar
- spelling
Are we seeing what we expected? (2)
In the previous step, we viewed a skim() of the "character" and "Date" variables in the People dataset. We’re going to continue ‘skimming’ these data and check them against our expectations.
Viewing numeric variables
We’ll use skimr::yank() and skimr::focus() to view the n_missing and complete_rate for the "numeric" variables in People.
PeopleSkim %>%
focus(n_missing, complete_rate) %>%
yank("numeric")Variable type: numeric
| skim_variable | n_missing | complete_rate |
|---|---|---|
| birthYear | 114 | 0.99 |
| birthMonth | 282 | 0.99 |
| birthDay | 424 | 0.98 |
| deathYear | 10232 | 0.49 |
| deathMonth | 10233 | 0.49 |
| deathDay | 10234 | 0.49 |
| weight | 817 | 0.96 |
| height | 737 | 0.96 |
The complete_rate for birthYear, birthMonth, birthDay, weight and height are over 90%. However, the deathYear, deathMonth and height is under 50%. Why do you think these data have missing values?
The skim() output for the "numeric" variables give us a set of summary statistics:
Location statistics
- the
mean(or average) gives us the expected value for each variable
- the median (as
p50) or the ‘center’ value for each variable. Half of the values are above, and half are below.
PeopleSkim %>%
skimr::focus(numeric.mean, numeric.p50) %>%
skimr::yank("numeric") Variable type: numeric
| skim_variable | mean | p50 |
|---|---|---|
| birthYear | 1934.42 | 1942 |
| birthMonth | 6.63 | 7 |
| birthDay | 15.62 | 16 |
| deathYear | 1966.29 | 1968 |
| deathMonth | 6.49 | 6 |
| deathDay | 15.53 | 15 |
| weight | 187.80 | 185 |
| height | 72.34 | 72 |
Spread statistics
- the lowest value for each variable, or minimum (as
p0)
- the highest value for each variable, or maximum (as
p100)
- Together, these two values can give us the range, which is the difference between the maximum and minimum values
PeopleSkim %>%
skimr::focus(numeric.p0, numeric.p100) %>%
skimr::yank("numeric") Variable type: numeric
| skim_variable | p0 | p100 |
|---|---|---|
| birthYear | 1820 | 2000 |
| birthMonth | 1 | 12 |
| birthDay | 1 | 31 |
| deathYear | 1872 | 2020 |
| deathMonth | 1 | 12 |
| deathDay | 1 | 31 |
| weight | 65 | 2125 |
| height | 43 | 83 |
- the first quartile (as
p25), which is the ‘middle’ of the data points below the median
- the third quartile (as
p75), which is the ‘middle’ of the data points above the median
- Together, these two values can give us the interquartile range (IQR), which is the difference between the third and first quartiles
PeopleSkim %>%
skimr::focus(numeric.p25, numeric.p75) %>%
skimr::yank("numeric") Variable type: numeric
| skim_variable | p25 | p75 |
|---|---|---|
| birthYear | 1896 | 1973 |
| birthMonth | 4 | 10 |
| birthDay | 8 | 23 |
| deathYear | 1943 | 1993 |
| deathMonth | 3 | 10 |
| deathDay | 8 | 23 |
| weight | 172 | 200 |
| height | 71 | 74 |
- the standard deviation (as
sd), a measure of each variable’s disbursement. - The standard deviation describes how far a variable’s values are spread out around their mean
PeopleSkim %>%
skimr::focus(numeric.mean, numeric.sd) %>%
skimr::yank("numeric") Variable type: numeric
| skim_variable | mean | sd |
|---|---|---|
| birthYear | 1934.42 | 42.73 |
| birthMonth | 6.63 | 3.47 |
| birthDay | 15.62 | 8.76 |
| deathYear | 1966.29 | 32.95 |
| deathMonth | 6.49 | 3.53 |
| deathDay | 15.53 | 8.79 |
| weight | 187.80 | 26.33 |
| height | 72.34 | 2.61 |
These numbers can be challenging to make sense of by themselves. Fortunately, the skimr::skim() output comes with a hist column. The hist column is a small histogram for the numeric variables.
Below we use skimr::focus() and skimr::yank() to view the mean, standard deviation (sd), minimum (p0), median (p50), maximum (p100), and hist for the numeric variables in People.
PeopleSkim %>%
skimr::focus(numeric.mean, numeric.sd,
numeric.p0, numeric.p50, numeric.p100,
numeric.hist) %>%
skimr::yank("numeric") Variable type: numeric
| skim_variable | mean | sd | p0 | p50 | p100 | hist |
|---|---|---|---|---|---|---|
| birthYear | 1934.42 | 42.73 | 1820 | 1942 | 2000 | ▁▅▅▆▇ |
| birthMonth | 6.63 | 3.47 | 1 | 7 | 12 | ▇▅▅▆▇ |
| birthDay | 15.62 | 8.76 | 1 | 16 | 31 | ▇▇▇▇▆ |
| deathYear | 1966.29 | 32.95 | 1872 | 1968 | 2020 | ▁▃▆▇▇ |
| deathMonth | 6.49 | 3.53 | 1 | 6 | 12 | ▇▅▅▅▇ |
| deathDay | 15.53 | 8.79 | 1 | 15 | 31 | ▇▇▇▆▆ |
| weight | 187.80 | 26.33 | 65 | 185 | 2125 | ▇▁▁▁▁ |
| height | 72.34 | 2.61 | 43 | 72 | 83 | ▁▁▁▇▁ |
The hist column shows us a miniature distribution of the values in each numeric variable.
Do these numbers make sense?
As we can see, the majority of the missing values are in the variables with the
deathprefix (deathDay,deathMonth, anddeathYear). The missing values in these variables make sense because, given the lowestbirthYearvalue (1820), we should expect approximately half of the baseball players in thePeopledataset to be still alive.We also notice an implausible value from the
skimroutput: theweightvariable maximum value (2125). We can usedplyr’sfilterandselectfunctions to find thenameGivenfor the abnormally highweightvalue.
People %>%
filter(weight == 2125) %>%
select(nameGiven, birthMonth, birthDay, birthYear, weight)#> # A tibble: 1 x 5
#> nameGiven birthMonth birthDay birthYear weight
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Jacob Robert 10 28 1996 2125
Google the player’s name. What is his listed weight on Wikipedia?
step 6
- included in step6.md?
-
grammar
- spelling
Counting things
“Data science is mostly counting things.” - Sam Firke
Data visualizations are drawings made with numbers. The challenge is picking the best image for the numbers you want to show. Before you can choose what you want to draw, you need to decide which numbers you’d like to display.
Column/bar charts
In a bar (or column) chart, each bar/column length represents a numeric value. The number of levels determines the number of bars or columns.
We will create a bar chart of the bats variable in People, which measures whether the player bats left-handed (L), right-handed (R), both (B), or if these data are missing (NA). Below we’ll use dplyr ’s count() function to tally the number of values for the different category items in bats.
People %>%
count(bats, sort = TRUE)#> # A tibble: 4 x 2
#> bats n
#> <chr> <int>
#> 1 R 12435
#> 2 L 5246
#> 3 B 1229
#> 4 <NA> 1180
In ggplot2, we create a bar chart using the geom_bar() function. First we map bats to both x and fill inside the ggplot(aes()) functions. If you need a refresher on ggplot2 layers and mapping, check out the previous scenario.
We also remove the legend with guides(fill = FALSE), and add labels for title, subtitle, caption, and y axis (x is set to NULL).
# click to execute code
gg_step6_bar_01 <- People %>%
ggplot(aes(x = bats, fill = bats)) +
geom_bar() +
guides(fill = FALSE) +
labs(title = "MILB Player's batting hand",
subtitle = "Left (L), right (R), or both (B)",
caption = "source: http://www.seanlahman.com/",
x = NULL, y = "Number of birth countries")
# save
# ggsave(plot = gg_step6_bar_01,
# filename = "gg-step6-bar-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step6_bar_01We will need to open the gg-step6-bar-01.png graph in the vscode IDE (above the Terminal console).
Counting top 10 birth countries
geom_bar() only takes a single categorical (or factor) variable. Sometimes we’ll need to specify the variable we want to map to the y axis. For example, the code below uses dplyr::filter() and dplyr::count() to get return top 10 non-US ‘Country of birth’ for players in the People dataset. We then use utils::head() to return the top 10 rows, and dplyr::mutate() with forcats::fct_inorder() to change the format of the birthCountry variable to a factor.
People %>%
filter(birthCountry != "USA" & !is.na(birthCountry)) %>%
count(birthCountry, sort = TRUE) %>%
head(10) %>%
mutate(birthCountry = fct_inorder(birthCountry))#> # A tibble: 10 x 2
#> birthCountry n
#> <fct> <int>
#> 1 D.R. 791
#> 2 Venezuela 425
#> 3 P.R. 270
#> 4 CAN 255
#> 5 Cuba 223
#> 6 Mexico 136
#> 7 Japan 70
#> 8 Panama 65
#> 9 United Kingdom 52
#> 10 Ireland 50
In this case, we have two variables in this output: birthCountry and ‘n’. If we’re going to build a graph from these data, we know we’ll need a way to represent both the country’s name and the number of times it occurs.
For this graph, we will need to use ggplot2'sgeom_col()` function (we will include the code from above, which creates a dataset with only the top 10 non-US birth countries).
We also map birthCountry to fill, remove the legend with guides(fill = FALSE), and add a label for the y axis with ggplot2::labs().
# click to execute code
gg_step6_col_01 <- People %>%
filter(birthCountry != "USA" & !is.na(birthCountry)) %>%
count(birthCountry, sort = TRUE) %>%
head(10) %>%
mutate(birthCountry = fct_inorder(birthCountry)) %>%
ggplot(aes(x = birthCountry, y = n, fill = birthCountry)) +
geom_col() +
guides(fill = FALSE) +
labs(title = "Top 10 Non-US birth countries for MLB players",
subtitle = "Based on birthCountry",
caption = "source: http://www.seanlahman.com/",
x = NULL, y = "Number of birth countries")
# save
# ggsave(plot = gg_step6_col_01,
# filename = "gg-step6-col-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step6_col_01We will need to open the gg-step6-col-01.png graph in the vscode IDE (above the Terminal console).
Flip the coordinates
If we find the x axis gets cluttered and difficult to read, we can pivot the columns’ display with the ggplot2::coord_flip() function. Note that we also need to change the forcats function in mutate() to fct_reorder().
# click to execute code
gg_step6_col_02 <- People %>%
filter(birthCountry != "USA" & !is.na(birthCountry)) %>%
count(birthCountry, sort = TRUE) %>%
head(10) %>%
# reorder the birthCountry by n
mutate(birthCountry = fct_reorder(birthCountry, n)) %>%
ggplot(aes(x = birthCountry, y = n,
fill = birthCountry)) +
geom_col() +
guides(fill = FALSE) +
# flip coordinates
coord_flip() +
labs(title = "Top 10 Non-US birth countries for MLB players",
subtitle = "Based on birthCountry",
caption = "source: http://www.seanlahman.com/",
x = NULL, y = "Number of birth countries")
# save
# ggsave(plot = gg_step6_col_02,
# filename = "gg-step6-col-02.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step6_col_02We will need to open the gg-step6-col-02.png graph in the vscode IDE (above the Terminal console).
Communication tips
Bar charts and column charts display the counts for each different response in the categorical variable. We can help our audience interpret these graphs by always setting the count axis scale to zero, sorting the chart’s values to make it easier to read, and flipping the x axis if it is difficult to read.
step 7
- included in step7.md?
-
grammar?
- spelling?
Single variable distributions (1)
The skimr output displayed a small histogram for each numeric variable in the People dataset in the previous steps. Histograms are a special kind of bar/column chart–they show the distribution for numeric variables by ‘binning’ each response into a set number of bars/columns.
Load data
These data come from the TidyTuesday project, a data repository who’s intent is
“to provide a safe and supportive forum for individuals to practice their wrangling and data visualization skills independent of drawing conclusions.”
We’re going to use a dataset of Ramen ratings from The Ramen Rater. Read more about these data here.
Below we import the raw data from an external .csv file into Ramen and get a skimr::skim() summary (stored in RamenSkim)
Ramen <- readr::read_csv("https://bit.ly/38sO0S7")
RamenSkim <- skimr::skim(Ramen)Review data
View the character variables in RamenSkim
RamenSkim %>%
skimr::yank("character")Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| brand | 0 | 1 | 1 | 34 | 0 | 456 | 0 |
| variety | 0 | 1 | 3 | 96 | 0 | 2971 | 0 |
| style | 2 | 1 | 3 | 10 | 0 | 8 | 0 |
| country | 0 | 1 | 2 | 13 | 0 | 44 | 0 |
How complete are these data?
View the mean, standard deviation (sd), minimum (p0), median (p50), maximum (p100), and hist for the numeric variables in Ramen.
RamenSkim %>%
skimr::focus(numeric.mean, numeric.sd,
numeric.p0, numeric.p50, numeric.p100,
numeric.hist) %>%
skimr::yank("numeric")Variable type: numeric
| skim_variable | mean | sd | p0 | p50 | p100 | hist |
|---|---|---|---|---|---|---|
| review_number | 1590.06 | 917.94 | 1 | 1590.00 | 3180 | ▇▇▇▇▇ |
| stars | 3.69 | 1.03 | 0 | 3.75 | 5 | ▁▁▂▇▅ |
Pay attention to the hist column for stars–it shows the distribution for the values. Where are most of the values concentrated?
We will investigate the distribution of stars by building a histogram with ggplot2.
Build a histogram
We’re going to use ggplot2::geom_histogram() to view the distribution the stars variable in Ramen. Note that we are also assigning labels to the graph that includes 1) a clear title, 2) descriptive information about the graph, 3) the source of the data.
# click to execute code
gg_step7_hist_01 <- Ramen %>%
ggplot(aes(x = stars)) +
geom_histogram() +
labs(
title = "Distribution of ramen stars",
subtitle = "bins = 30",
caption = "source: https://www.theramenrater.com/resources-2/the-list/")
# save
# ggsave(plot = gg_step7_hist_01,
# filename = "gg-step7-hist-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step7_hist_01We will need to open the gg-step7-hist-01.png graph in the vscode IDE (above the Terminal console).
As we stated above, histograms stack the variable values into a defined set of bins. The default number for bins is 30. We can change the shape of the histogram by changing the bins argument.
Run the code below to see how the distribution looks with 20 bins. Note we also include the color = "white" argument to ensure we can see each bar separately.
# click to execute code
gg_step7_hist_02 <- Ramen %>%
ggplot(aes(x = stars)) +
geom_histogram(bins = 20, color = "white") +
labs(
title = "Distribution of ramen stars",
subtitle = "bins = 20",
caption = "source: https://www.theramenrater.com/resources-2/the-list/")
# save
# ggsave(plot = gg_step7_hist_02,
# filename = "gg-step7-hist-02.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step7_hist_02Open the gg-step7-hist-02.png graph in the vscode IDE (above the Terminal console).
The stars values fit into 20 bins better than the default 30 because we can see where values are concentrated (and the high frequency of 5-star ratings).
step 8
- included in step8.md?
- grammar?
- spelling?
Single variable distributions (2)
The previous step demonstrated how to use a histogram to view the distribution of a single variable. We needed to adjust the bins in the histogram to make its shape easier to interpret. Density plots use kernel smoothing to create cleaner distributions.
Build a density plot
We’re going to use ggplot2::geom_density() to view a density plot of the stars variable in Ramen. We will use fill to color the area underneath the density line with "dodgerblue".
# click to execute code
gg_step8_density_01 <- Ramen %>%
ggplot(aes(x = stars)) +
geom_density(fill = "dodgerblue") +
labs(title = "Distribution of ramen stars",
caption = "source: https://www.theramenrater.com/resources-2/the-list/")
# save
# ggsave(plot = gg_step8_density_01,
# filename = "gg-step8-density-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step8_density_01Open the gg-step8-density-01.png graph in the vscode IDE (above the Terminal console).
Adding useful labels
Although density plots create a much smoother distribution, the y axis is harder to interpret. To overcome this, we will add two summary statistics programmatically to the labels using the base::paste0() and base::round() functions.
Run the code below to see how this works:
# click to execute code
subtitle_dens_stars <- paste0("Star rating (mean +/- sd): ",
# use round() to make sure there are only two decimals
round(mean(Ramen$stars, na.rm = TRUE), 2),
" +/- ",
round(sd(Ramen$stars, na.rm = TRUE), 2))
subtitle_dens_stars#> [1] "Star rating (mean +/- sd): 3.69 +/- 1.03"
We can now supply subtitle_dens_stars to the labs(subtitle = ) function.
Creating labels this way ensures they are updated whenever the underlying data change.
# click to execute code
gg_step8_density_02 <- Ramen %>%
ggplot(aes(x = stars)) +
geom_density(fill = "dodgerblue") +
labs(title = "Distribution of ramen stars",
# combine text with mean() and sd() for stars in Ramen
subtitle = subtitle_dens_stars,
# include source
caption = "source: https://www.theramenrater.com/resources-2/the-list/")
# save
# ggsave(plot = gg_step8_density_02,
# filename = "gg-step8-density-02.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step8_density_02Open the gg-step8-density-02.png graph in the vscode IDE (above the Terminal console).
As we’ve said, an essential part of effective communication is knowing your audience. It’s unlikely these exploratory graphs will be part of our final deliverable, so the audience for these graphs will likely be us!
Using descriptive labels makes sure we know what we’re seeing when we’re viewing our graphs.
step 9
- included in step9.md?
- grammar?
- spelling?
Multiple variable distributions (1)
We’ve looked at the distribution of all the values in the stars variable, but what if we were interested in the distribution of stars across the groups in another categorical variable, like style, which is the Style of container (cup, pack, tray, etc.).
We can check the levels of style with dplyr::count()
Ramen %>%
count(style, sort = TRUE)#> # A tibble: 9 x 2
#> style n
#> <chr> <int>
#> 1 Pack 1832
#> 2 Bowl 612
#> 3 Cup 559
#> 4 Tray 138
#> 5 Box 32
#> 6 Restaurant 3
#> 7 <NA> 2
#> 8 Bar 1
#> 9 Can 1
The output above tells us the top five most common reviews for Ramen came from Packs, Bowls, Cups, Trays, and Boxes.
Grouped skims
We can use dplyrs filter, select, and group_by functions with skimr to see the distribution of the stars variable across the five most common style levels.
# click to execute code
Ramen %>%
# filter to most common styles
filter(style %in% c("Pack", "Bowl",
"Cup", "Tray", "Box")) %>%
# select only stars and style
select(stars, style) %>%
# group dataset by style
group_by(style) %>%
# skim grouped data
skim() %>%
# focus on select output
skimr::focus(n_missing, style,
numeric.mean, numeric.sd, numeric.hist,
numeric.p0, numeric.p50, numeric.p100) %>%
# only return numeric values
skimr::yank("numeric") Variable type: numeric
| skim_variable | style | n_missing | mean | sd | p0 | p50 | p100 | hist |
|---|---|---|---|---|---|---|---|---|
| stars | Bowl | 0 | 3.71 | 1.05 | 0 | 3.75 | 5 | ▁▁▂▇▆ |
| stars | Box | 0 | 4.21 | 1.29 | 0 | 5.00 | 5 | ▁▁▁▂▇ |
| stars | Cup | 0 | 3.49 | 1.04 | 0 | 3.50 | 5 | ▁▁▂▇▃ |
| stars | Pack | 14 | 3.74 | 0.99 | 0 | 3.75 | 5 | ▁▁▂▇▆ |
| stars | Tray | 0 | 3.60 | 1.14 | 0 | 3.75 | 5 | ▁▁▂▇▆ |
The output shows Ramen from a Box has the highest stars rating. We are going to confirm this with a ridgeline plot.
The ggridges package
The mean and median (p50) in the skimr output tells us the distribution of stars varies slightly for the filtered levels of style, so we will view the density for each distribution with a ridgeline plot from the ggridges package.
Install and load ggridges below:
# click to execute code
install.packages("ggridges")
library(ggridges)Build labels first!
We’ll build the labels for this graph first in labs_ridge_stars_style, so we know what we’re expecting to see.
# click to execute code
labs_ridge_stars_style <- labs(
title = "Star ratings by style",
subtitle = "Star rating across most common ramen containers",
caption = "source: https://www.theramenrater.com/resources-2/the-list/",
x = "Star rating",
y = NULL) I’ve found this practice to be very helpful for conceptualizing graphs before I begin building them, which reduces errors and saves time!
Overlapping density plots
The code below uses ggridges::geom_density_ridges() function to build overlapping density plots. In this plot, we map the fill argument to the style variable. We also want to set the guides(fill = ) to FALSE because we’ll have labels on the graph for each level of style.
# # click to execute code
gg_step9_ridge_01 <- Ramen %>%
# filter to most common styles
filter(style %in% c("Pack", "Bowl",
"Cup", "Tray", "Box")) %>%
ggplot(aes(x = stars,
y = style,
fill = style)) +
geom_density_ridges() +
guides(fill = FALSE) +
# add labels
labs_ridge_stars_style
# # save
# ggsave(plot = gg_step9_ridge_01,
# filename = "gg-step9-ridge-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step9_ridge_01Open the gg-step9-ridge-01.png graph in the vscode IDE (above the Terminal console).
We can see that the stars ratings for the Box level in style are concentrated around 5 from the ridgeline plot.
step 10
- included in step10.md?
- grammar?
- spelling?
Multiple variable distributions (2)
In the last step, we learned the distribution for Ramen stars ratings varied across the five most common levels of style. This step will view the variation of stars across style with a box-plot. Box-plots are a great way of viewing the summary statistics for a numeric variable (like stars) across multiple levels of a categorical variable (like style).
Box-plot labels
We’ll build the labels for the graph similar to the labels we used for the ridgeline plot, but we’ll be a little more explicit with the subtitle and x axis.
# click to execute code
labs_box_stars_style <- labs(
title = "Star ratings by style",
subtitle = "Star ratings across pack, bowl, cup, tray, and box containers",
caption = "source: https://www.theramenrater.com/resources-2/the-list/",
x = "Ramen star ratings",
y = NULL) Building box-plots
We’ll filter the data to the five most common style’s again and map stars to the x axis and style to the y axis. We will also map style to the fill aesthetic inside ggplot2::geom_boxplot().
We don’t need a guide (or legend), so we will remove it with guides(fill = FALSE).
gg_step10_boxplot_01 <- Ramen %>%
# filter to most common styles
filter(style %in% c("Pack", "Bowl",
"Cup", "Tray", "Box")) %>%
ggplot(aes(x = stars, y = style)) +
geom_boxplot(aes(fill = style)) +
guides(fill = FALSE) +
labs_box_stars_style
# save
# ggsave(plot = gg_step10_boxplot_01,
# filename = "gg-step10-boxplot-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step10_boxplot_01Open the gg-step10-boxplot-01.png graph in the vscode IDE (above the Terminal console).
In the next step, we will cover how to interpret the contents of a box-plot.
step 11
- included in step11.md?
- grammar?
- spelling?
Multiple variable distributions (3)
Box plots display many of the numbers we see in the skimr output.
Contents of a box-plot
Click on the code below to create a skimr summary of Ramen stars ratings in `Tray’s.
# click to execute code
Ramen %>%
# filter to most common styles
filter(style == "Tray") %>%
# select only stars and style
select(stars, style) %>%
# group dataset by style
group_by(style) %>%
# skim grouped data
skimr::skim() %>%
# focus on select output
skimr::focus(style,
numeric.p0, numeric.p25, numeric.p50,
numeric.p75, numeric.p100, numeric.hist) %>%
# only return numeric values
skimr::yank("numeric") Variable type: numeric
| skim_variable | style | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|
| stars | Tray | 0 | 3 | 3.75 | 4.25 | 5 | ▁▁▂▇▆ |
We can calculate the interquartile range using dplyr below:
# click to execute code
Ramen %>%
# filter to most common styles
filter(style == "Tray") %>%
# select only stars and style
select(stars, style) %>%
# group dataset by style
group_by(style) %>%
# summarize IQR
summarize(
`Stars/Tray IQR` = IQR(stars, na.rm = TRUE))#> # A tibble: 1 x 2
#> style `Stars/Tray IQR`
#> * <chr> <dbl>
#> 1 Tray 1.25
The figure below shows a box-plot for the distribution of stars ratings across the Tray level of style. We’ve labeled the summary statistics from the skimr output on the graph. The 25th percentile (or p25) is the box’s first vertical line (or hinge). The 50th percentile (or p50) is the median and middle line of the box, and the 75th percentile (p75) is the last vertical line in the box (or hinge).
We’ve also labeled the minimum (p0) and maximum (p100) values and the interquartile range (which is similar to the standard deviation).
Communication tip
If your audience is not familiar with box-plots, the figure above is an example of supporting information to include. It explains how to read the graph, using an example from the finished chart. However, using complex plots adds mental labor to your audience and can take attention away from the point you’re trying to make. Effective communication means always using the most straightforward (or most common) graphs to reveal your findings.
step 12
- included in step12.md?
- grammar?
- spelling?
Communicating column/bar charts
Now that we’ve created a few graphs, we should stop and consider what narrative information we’ve contained in each plot. We used column and bar charts for displaying the counts for two categorical variables:
bats, which measures whether MLB players are left-handed (L), right-handed (R), both (B), or if these data are missing (NA)
birthCountry, which tells us the ‘Country of Birth’ for each MLB player
We can use these graphs to convey comparison information. For example, we can see from the bar graph that most MLB players are right-handed batters.
Bar graph
The column chart displays similar information. The columns can be used for comparing the frequency of one country to the other. We’ve also arranged the columns in a way that makes it clear which country appears the most and which country appears the least.
Column graph
The text to accompany your graphs will largely depend on the context of the problem you’re trying to solve (or question you’re trying to answer), but there are a few general guidelines we can apply to each type of graph.
Communcation (labels)
Titles should be objective and neutral, expressing the “who,” “what,” and “where” of the figure’s measurements. Avoid jargon and unnecessary descriptive words. Stick with 1) what data was measured, 2) when the data was measured, and 3) how the data was counted (i.e., the units).
When you are building labels, plan on providing enough information that the chart becomes a ‘stand-alone product.’ By this, we mean that if a new observer viewed your graph, they would at minimum be able to understand what point the figure was trying to make (i.e., “this graph shows the values in X variable,” or “this figure shows the relationship between X and Y”).
Communication (distributions)
We’ve explored the distribution of the stars rating variable in the Ramen dataset by itself in the histogram and density plot and across the top five types in the ridgeline plot and box-plot.
Histogram
Density graph
Ridgeline plot
Box-plot
Distribution plots are useful if we’re answering exploratory questions about a variable before calculating statistics or building a model. Histograms, density, and ridgeline plots can quickly tell us if a variable has a normal (bell-shaped) distribution, which is a crucial assumption to check before modeling. Box-plots require a higher degree of statistical literacy for interpretation, so we recommend confirming your audience is familiar with these graphs before relying on them. Summary statistics are also vital to include with distribution graphs (usually in a supplementary table) because it tethers the figure to mathematical values.
The information from these exploratory charts gives your narrative context and frames the problem. If we were telling a story, this would be the portion that tells us the setting or universe in which our characters live.
step 13
- included in step13.md?
- grammar?
- spelling?
Visualizing relationships (1)
Now that we know how to explore variable distributions, we will look at relationships between two (or more) variables. We must establish want kind of relationship we’re investigating before deciding what plot to make. We’ve already been exploring the relationship between two variables. We looked at the distribution (or spread) of stars vs. five categories of style in the Ramen dataset, and we also looked at the birthCountrys vs. n (the count of each birth country) in the People dataset.
In this step, we will look at how two numeric variables change (or vary) and if that change is in the same (or opposite) direction. One of the most common graphs for visualizing this relationship between two numerical variables is the scatterplot. A scatterplot uses points to display two numeric variables, with one variable on each axis.
Star Wars data
We will load the starwars data from the dplyr package. These data come from the Star Wars API. Read more about this dataset on dplyrs website.
# click to execute code
StarWars <- dplyr::starwars We will look at the relationship between height and mass for characters in the StarWars dataset. Let’s start by looking at the n_missing and complete_rate for these two variables.
# click to execute code
StarWars %>%
select(height, mass) %>%
skimr::skim() %>%
skimr::focus(n_missing, complete_rate) %>%
skimr::yank("numeric")Variable type: numeric
| skim_variable | n_missing | complete_rate |
|---|---|---|
| height | 6 | 0.93 |
| mass | 28 | 0.68 |
We can see that almost 2/3rds of the data in mass are missing. This amount of missing data might surprise us if we didn’t explore it before plotting.
We will also view the mean, standard deviation (sd), minimum (p0), 25th percentile (p25), median (p50), 75th percentile (p75), maximum (p100), and hist for these two columns.
# click to execute code
StarWars %>%
select(height, mass) %>%
skimr::skim() %>%
skimr::focus(numeric.mean, numeric.sd,
numeric.p0, numeric.p25,
numeric.p50, numeric.p75,
numeric.p100, numeric.hist) %>%
skimr::yank("numeric") Variable type: numeric
| skim_variable | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|
| height | 174.36 | 34.77 | 66 | 167.0 | 180 | 191.0 | 264 | ▁▁▇▅▁ |
| mass | 97.31 | 169.46 | 15 | 55.6 | 79 | 84.5 | 1358 | ▇▁▁▁▁ |
We can see at least one value of mass that is considerably higher than the rest. We can tell because the location statistics are similar to each other (mean = 97.3, median (p50) = 84.5), but the spread is almost twice the value of the location (sd = 169). The maximum value (p100) of 1358 also confirms this finding. What is going on with this value?
Investigate outliers
It’s always a good idea to investigate values that seem implausible (like we did with the abnormally high weight for the MLB player in step 5). If we can’t figure out what is going on, we should communicate this with our stakeholders. Outliers can have a big impact on data visualizations (and statistical models), so ensuring we account for them is essential for communicating with our audience.
We’re going to filter the StarWars data only observations with mass more than 180, and select only the name, height, mass, sex and species columns (we chose 200 because it’s approximately 2x the p75 value).
StarWars %>%
filter(mass > 200) %>%
select(name, height, mass, sex, species)#> # A tibble: 1 x 5
#> name height mass sex species
#> <chr> <int> <dbl> <chr> <chr>
#> 1 Jabba Desilijic Tiure 175 1358 hermaphroditic Hutt
We can now see this mass belongs to Jabba the Hutt, which makes sense if we do some additional research.
Labels
Now that we know what we’re going to visualize, we can make our labels.
# click to execute code
labs_scatter_ht_mass_01 <- labs(
title = "Star Wars Character's height and mass",
x = "Mass",
y = "Height")Scatterplots
We will create a scatterplot with ggplot2::geom_point() by mapping mass to the x axis and map height to the y axis.
# click to execute code
gg_step13_scatter_01 <- StarWars %>%
ggplot(aes(x = mass, y = height)) +
geom_point() +
# add labels
labs_scatter_ht_mass_01
# save
# ggsave(plot = gg_step13_scatter_01,
# filename = "gg-step13-scatter-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step13_scatter_01Open the gg-step13-scatter-01.png graph in the vscode IDE (above the Terminal console). Notice how the points are all clustered on the left-hand side of the chart? The x axis has extended to account for Jabba the Hutt’s high mass value, which has made it hard to interpret the relationship among the other values.
What happens when we remove the outliers?
If you encounter outliers, it’s a good idea to view each graph with and without them to see how much they influence the plot. Let’s filter Jabba the Hutt’s mass out of the Starwars dataset and rebuild the scatterplot. We’ll also add this information to a new set of labels, so we don’t get the two graphs confused.
# click to execute code
# new labels
labs_scatter_ht_mass_02 <- labs(
title = "Star Wars Character's height and mass",
subtitle = "Characters with mass less than 200",
x = "Mass",
y = "Height")
# build graph
gg_step13_scatter_02 <- StarWars %>%
filter(mass < 200) %>%
ggplot(aes(x = mass, y = height)) +
geom_point() +
# add labels
labs_scatter_ht_mass_02
# save
# ggsave(plot = gg_step13_scatter_01,
# filename = "gg-step13-scatter-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step13_scatter_02Open the gg-step13-scatter-02.png graph in the vscode IDE (above the Terminal console).
Based on the scatterplot, we can see a positive relationship between mass and height for Star Wars characters. But is this the same for all types of characters? For example, does this relationship hold for all levels of gender?
step 14
- included in step14.md?
- grammar?
- spelling?
Visualizing relationships (2)
We will continue looking at the relationship between mass and height in the Starwars dataset. We looked at mass and height with and without an outlier’s influence in the previous step. In this step, we will add a categorical variable (gender) to the plot to see if the direction of the change for mass and height is the same for all levels of gender.
Counting with tabyls
Let’s view the count of gender below using the tabyl() function from the janitor package.
# click to execute code
install.packages("janitor")
library(janitor)janitor::tabyl() works similar to dplyr::count(), but automatically prints a bit more information in the output. Click on the code block below to create a tably for the gender variable.
# click to execute code
StarWars %>%
janitor::tabyl(gender) #> gender n percent valid_percent
#> feminine 17 0.19540230 0.2048193
#> masculine 66 0.75862069 0.7951807
#> <NA> 4 0.04597701 NA
We can see the standard output produces a percent and valid_percent columns. We can also add percent formatting with janitor::adorn_pct_formatting():
# click to execute code
StarWars %>%
janitor::tabyl(gender) %>%
janitor::adorn_pct_formatting()#> gender n percent valid_percent
#> feminine 17 19.5% 20.5%
#> masculine 66 75.9% 79.5%
#> <NA> 4 4.6% -
This output tells us 4 characters in the StarWars dataset will not show up if we use the gender variable. Read more about the tabyl function options here.
Scatterplot (3 variables)
One way to include the gender variable in the scatterplot is to map it to the color aesthetic. The output from tabyl tells us there are 4 values in gender that will be missing from this plot.
We will update our labels and add gender to the scatterplot in the code below.
# click to execute code
labs_scatter_ht_mass_gender <- labs(
title = "Star Wars Character's gender, height and mass",
subtitle = "Data for gender (feminine/masculine), height, and mass < 200",
x = "Mass",
color = "Gender",
y = "Height")
gg_step14_scatter_01 <- StarWars %>%
filter(!is.na(gender) & mass < 200) %>%
ggplot(aes(x = mass, y = height, color = gender)) +
geom_point() +
# add labels
labs_scatter_ht_mass_02
# save
# ggsave(plot = gg_step14_scatter_01,
# filename = "gg-step14-scatter-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step14_scatter_01Open the gg-step14-scatter-01.png graph in the vscode IDE (above the Terminal console).
The color of the points shows that the feminine characters occupy a smaller range of values for the relationship between mass and height.
step 15
- included in step13.md?
- grammar?
- spelling?
Visualizing relationships (3)
Sometimes we will want to look at how a particular measurement changes over time or trends. When we’re visualizing trends, the x axis typically the measure of time, and the y axis contains our measurement of interest. Each time point along the x axis has a corresponding value on the y axis, and lines connect these points. These lines are extended along the x axis’s full scale to display the change over time (or the trend). See the example from the FiveThirtyEight article titled, “Comic Books Are Still Made By Men, For Men And About Men”:
We’re going to re-create this chart using data from the fivethirtyeightdata package.
Comic Book Data
The fivethirtyeightdata package contains data from the FiveThirtyEight Github repository, but these data have been formatted to provide “tame data principles for introductory statistics and data science courses.”
We are going to load the comic_characters dataset from the article above. We’re only interested in a subset of this dataset, so we select the relevant variables and do some initial formatting steps before assigning them to the ComicData (read more about the data here).
# click to execute code
ComicData <- read_csv("https://bit.ly/3oS1zQY") # subset data
ComicData <- ComicData %>%
# select only publisher, name, sex, year, and date
select(publisher, name, sex, year, date) %>%
# filter to only the rows containing either male or female characters
filter(sex %in% c("Female Characters", "Male Characters")) %>%
# convert these two variables to factors
mutate(sex = factor(sex,
levels = c("Female Characters",
"Male Characters")),
publisher = factor(publisher,
levels = c("Marvel", "DC"))) %>%
# remove all missing values
drop_na()
# view
glimpse(ComicData)#> Rows: 21,408
#> Columns: 5
#> $ publisher <fct> DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC, DC,…
#> $ name <chr> "Batman (Bruce Wayne)", "Superman (Clark Kent)", "Green Lan…
#> $ sex <fct> Male Characters, Male Characters, Male Characters, Male Cha…
#> $ year <dbl> 1939, 1986, 1959, 1987, 1940, 1941, 1941, 1989, 1969, 1956,…
#> $ date <date> 1939-05-01, 1986-10-01, 1959-10-01, 1987-02-01, 1940-04-01…
We formatted two variables in ComicData as factor's. We will useskimr::skim()to get an overview ofpublisherandsex`.
Factors
Factor variables are unique kinds of qualitative or categorical variables in R because they have a “fixed and known set of possible values.”. We assigned these values with the levels argument.
The skimr output below shows us the two new factor variables we’ve created in ComicNewFemalePerc.
# click to execute code
ComicData %>%
skimr::skim() %>%
skimr::yank("factor") Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| publisher | 0 | 1 | FALSE | 2 | Mar: 14728, DC: 6680 |
| sex | 0 | 1 | FALSE | 2 | Mal: 15834, Fem: 5574 |
The top_counts column tells us the counts for both levels in publisher and sex.
Summarizing data
To recreate the graph above, we’ll need to summarize the ComicData data. We need year represented on the x axis, and the percentage of female comic book characters for each publisher represented on the y axis. We can do this with dplyrs group_by(), summarize(), mutate() and ungroup() functions.
The code below creates two new variables:
sex_n_per_yr_pub, which is the count of each level ofsexper year andpublisher, and
sex_pct_per_yr_pub, which is the percentage of each level of sex peryearandpublisher
We also filter the year variable to only data between 1980 and 2010.
# click to execute code
ComicNewFemalePerc <- ComicData %>%
group_by(year, publisher, sex) %>%
summarize(sex_n_per_yr_pub = sum(n())) %>%
group_by(year, publisher) %>%
mutate(sex_pct_per_yr_pub = sex_n_per_yr_pub / sum(sex_n_per_yr_pub),
sex_pct_per_yr_pub = round(sex_pct_per_yr_pub, digits = 3)) %>%
ungroup() %>%
filter(year > 1979 & year < 2011)
head(ComicNewFemalePerc)#> # A tibble: 6 x 5
#> year publisher sex sex_n_per_yr_pub sex_pct_per_yr_pub
#> <dbl> <fct> <fct> <int> <dbl>
#> 1 1980 Marvel Female Characters 60 0.243
#> 2 1980 Marvel Male Characters 187 0.757
#> 3 1980 DC Female Characters 10 0.278
#> 4 1980 DC Male Characters 26 0.722
#> 5 1981 Marvel Female Characters 58 0.271
#> 6 1981 Marvel Male Characters 156 0.729
Labels
We will build labels identical to those in the FiveThirtyEight graph but include the original article’s URL as a caption.
# click to execute code
labs_line_comics <- labs(
title = "Comics Aren't Gaining Many Female Characters",
subtitle = "Percentage of new characters who are female",
caption = "https://fivethirtyeight.com/features/women-in-comic-books/",
x = NULL,
y = NULL)Line graph
Now that we have our data and labels, we can build the line graph. First, we need to filter the data to only female percentages, then pass the filtered data to ggplot(aes()), mapping the year to the x axis and sex_pct_per_yr_pub to the y axis.
On the next layer, inside the ggplot2::geom_line() function, we map publisher to the group and color aesthetics inside the aes() function. Outside the aes() function, we make the lines larger by setting the size to 2.
We need to make a few more customizations to this graph to make it look like the one in the article:
Note the FiveThirtyEight graph does not have a legend or guide. We can remove the legend by adding a
ggplot2::theme(legend.position = "none")layer.The
yaxis in the FiveThirtyEight graph ranges from0to50and is formatted as a percent (%). Displaying percentage units helps the audience understand that a proportion is displayed (not the raw counts). We can change the formatting on theyaxis with theggplot2::scale_y_continuous()function. Set thelimitsargument toc(0.00, 0.50)), and thelabelsargument toscales::percent_format(accuracy = 1).
# click to execute code
gg_step15_line_01 <- ComicNewFemalePerc %>%
filter(sex == "Female Characters") %>%
ggplot(aes(x = year, y = sex_pct_per_yr_pub)) +
geom_line(aes(group = publisher, color = publisher),
size = 2) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(0.00, 0.50),
labels = scales::percent_format(accuracy = 1)) +
# add labels
labs_line_comics
# save
# ggsave(plot = gg_step15_line_01,
# filename = "gg-step15-line-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step15_line_01Open the gg-step15-line-01.png graph in the vscode IDE (above the Terminal console).
Our graph is starting to look like the figure in the article, but we still need to make a few changes. We removed the legend, so we have no way of knowing which line belongs to which publisher (DC or Marvel). In the next section, we will learn how to add these labels onto the graph near their respective lines.
step 16
- included in step16.md?
- grammar?
- spelling?
Adding text annotations
In the previous step, we started recreating the following graph from the FiveThirtyEight article titled, “Comic Books Are Still Made By Men, For Men And About Men”.
We left off with the line graph in the gg-step15-line-01.png file. The FiveThirtyEight graph designers chose to remove the legend and add text labels directly to the chart, helping communicate the trend to audiences. This practice isn’t always the best choice, but it works in this graph for a few reasons, which we will cover below.
When to add text to graphs
Text on graphs is essential to communicating the graph’s contents. After all, a figure without text is just a drawing. We’ve already covered what information to include in the graph labels (i.e., the title, subtitle, caption, and x and y axes). Annotations are additional text we place in the plot area to help the audience understand what we’re trying to show.
We recommend adding text annotations as long as they aren’t obscuring or distracting the audience from the point you’re trying to make.
In the FiveThirtyEight graph, the text annotations work because there are only two levels for the measure of interest (DC and Marvel). Encoding this information in color aesthetics is helpful because we can set them to contrasting hues (i.e., blue and red). If there were more publisher levels, the color aesthetic might not be the best choice.
Lastly, the text annotations highlight the graph title’s information (“Comics Aren’t Gaining Many Female Characters”). Both annotations are in the same general area on the x axis, and their placement on the y axis is far enough from the line that the text doesn’t overlap.
Building text annotations
To get our graph closer to the FiveThirtyEight figure, we’re going to set the line colors with ggplot2::scale_color_manual(). This function takes a values argument, which we will fill with two colors, c("firebrick3", "dodgerblue").
In ggplot2, we can add text annotations with the annotate() function. Each level of publisher gets a layer and requires the following arguments:
geom = "text"this letsannotate()know we want text
x =this is the position on thexaxis we want the annotation
y =this is the position on theyaxis we want the annotation
label =this is the text we want to use for the annotation
size = 7the size of the text
color =the same values we supplied toscale_color_manual()
# click to execute code
gg_step16_line_01 <- ComicNewFemalePerc %>%
filter(sex == "Female Characters") %>%
ggplot(aes(x = year, y = sex_pct_per_yr_pub)) +
geom_line(aes(group = publisher, color = publisher),
size = 2) +
scale_y_continuous(limits = c(0.00, 0.50),
labels = scales::percent_format(accuracy = 1)) +
# color the axis
scale_color_manual(values = c("firebrick3",
"dodgerblue")) +
# add text annotations
annotate(geom = "text", x = 2001, y = .47,
label = "DC", size = 7,
color = "dodgerblue") +
annotate(geom = "text", x = 2002, y = .25,
label = "Marvel", size = 7,
color = "firebrick3") +
# add labels
labs_line_comics
# save
# ggsave(plot = gg_step16_line_01,
# filename = "gg-step16-line-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step16_line_01Open the gg-step16-line-01.png graph in the vscode IDE (above the Terminal console).
We’re almost there! We will make a few final adjustments to this graph to change the fonts displayed on the title and x and y axes.
Adjusting font attributes
The title font in the FiveThirtyEight graph is bold, and the subtitle text size is approximately the same size as the text on the x and y axes. We can adjust the font attributes in a ggplot2 graph with the theme() function. We’ve already used the theme() function in the previous step to remove the legend (legend.position = "none").
The ggplot2::theme() function comes with many options for customizing the way our graph looks. We will only cover a small number of these, and you should read more about using ggplot2 in the excellent R Graphics Cookbook, 2nd Edition by Winston Chang.
To change the font text attributes, we will set the following arguments in theme() (note that each argument also takes the ggplot2::element_text() function):
- Change the
plot.titletext withelement_text(size = 18, face = "bold")
- Change the
plot.subtitletextelement_text(size = 15)
- Change the
axis.text.yandaxis.text.xtext withelement_text(size = 14)
# click to execute code
gg_step16_line_02 <- ComicNewFemalePerc %>%
filter(sex == "Female Characters") %>%
ggplot(aes(x = year, y = sex_pct_per_yr_pub)) +
geom_line(aes(group = publisher, color = publisher),
size = 2) +
scale_y_continuous(limits = c(0.00, 0.50),
labels = scales::percent_format(accuracy = 1)) +
# adjust text on labels and axes
theme(legend.position = "none",
plot.title = element_text(size = 18,
face = "bold"),
plot.subtitle = element_text(size = 15),
axis.text.y = element_text(size = 14),
axis.text.x = element_text(size = 14)) +
# color the axis
scale_color_manual(values = c("firebrick3",
"dodgerblue")) +
# add text annotations
annotate(geom = "text", x = 2001, y = .47,
label = "DC", size = 7,
color = "dodgerblue") +
annotate(geom = "text", x = 2002, y = .25,
label = "Marvel", size = 7,
color = "firebrick3") +
# add labels
labs_line_comics
# save
# ggsave(plot = gg_step16_line_02,
# filename = "gg-step16-line-02.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step16_line_02Open the gg-step16-line-02.png graph in the vscode IDE (above the Terminal console).
Now our plot looks very close to the original graph in the FiveThirtyEight article. If you like the looks of this graph, be sure to check out the ggthemes package for the theme_fivethirtyeight() and others.
step 17
- included in step17.md?
- grammar?
- spelling?
Multiple graph displays
Up to this point, the data we’ve presented resulted in a single visualization. Sometimes data are so complex and layered that they will require multiple graphs. In this case, we will want to arrange numerous figures together in a grid or facet. Charts presented this way are called ‘Small Multiples’, a term popularized by Edward Tufte.
“Small multiples resemble the frames of a movie: a series of graphics, showing the same combination of variables, indexed by changes in another variable.” Edward Tufte, The Visual Display of Quantitative Information.
We are going to demonstrate faceting with a dataset we’ve created from the internet movie database (IMDB).
IMDB data
IMDB makes multiple datasets available for download. We’ve combined the title.ratings.tsv, name.basics.tsv, and title.principals.tsv datasets into the ImdbData dataset with the following columns.
tconst - alphanumeric unique identifier of the title (used for joining)
nconst - alphanumeric unique identifier of the name/person (used for joining)
category - the category of job that person was in
primaryName – name by which the person is most often credited
birthYear – in YYYY format
averageRating – weighted average of all the individual user ratings
numVotes - number of votes the title has received
primaryTitle – the more popular title / the title used by the filmmakers on promotional materials at the point of release
originalTitle - original title, in the original language
isAdult - non-adult title or adult title
startYear – represents the release year of a title. In the case of TV Series, it is the series start year.
runtimeMinutes – primary runtime of the title, in minutes
genres – includes up to three genres associated with the title.
age_lead - age of actor/actress at time of film (age_lead = startYear - birthYear)
# click to execute code
ImdbData <- readr::read_csv("https://bit.ly/3qr6rg6")
glimpse(ImdbData)#> Rows: 136,925
#> Columns: 14
#> $ tconst <chr> "tt0000574", "tt0000630", "tt0000886", "tt0001101", "t…
#> $ nconst <chr> "nm0846887", "nm0624446", "nm0609814", "nm0923594", "n…
#> $ category <chr> "actress", "actress", "actor", "actor", "actor", "actr…
#> $ primaryName <chr> "Elizabeth Tait", "Fernanda Negri Pouget", "Jean Moune…
#> $ birthYear <dbl> 1879, 1889, 1841, 1870, 1869, 1844, 1891, 1879, 1867, …
#> $ averageRating <dbl> 6.1, 3.2, 5.0, 5.2, 4.7, 5.8, 5.5, 3.9, 4.6, 5.3, 4.2,…
#> $ numVotes <dbl> 609, 11, 23, 13, 10, 22, 47, 11, 13, 28, 14, 13, 25, 1…
#> $ primaryTitle <chr> "The Story of the Kelly Gang", "Hamlet", "Hamlet, Prin…
#> $ originalTitle <chr> "The Story of the Kelly Gang", "Amleto", "Hamlet", "Ab…
#> $ isAdult <chr> "non-adult title", "non-adult title", "non-adult title…
#> $ startYear <dbl> 1906, 1908, 1910, 1910, 1910, 1912, 1910, 1911, 1910, …
#> $ runtimeMinutes <dbl> 70, NA, NA, NA, NA, NA, NA, NA, NA, 50, NA, NA, NA, NA…
#> $ genres <chr> "Biography,Crime,Drama", "Drama", "Drama", "\\N", "Cri…
#> $ age_lead <dbl> 27, 19, 69, 40, 41, 68, 19, 32, 43, 28, 52, 39, 23, 41…
Below we build a skimr::skim() of the ImdbData dataset.
# click to execute code
SkimImdbData <- skimr::skim(ImdbData)
summary(SkimImdbData)| Name | ImdbData |
| Number of rows | 136925 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
First, we will review the character variables.
# click to execute code
SkimImdbData %>%
skimr::yank("character")Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| tconst | 0 | 1 | 9 | 10 | 0 | 136925 | 0 |
| nconst | 0 | 1 | 9 | 10 | 0 | 41765 | 0 |
| category | 0 | 1 | 5 | 7 | 0 | 2 | 0 |
| primaryName | 0 | 1 | 2 | 38 | 0 | 41658 | 0 |
| primaryTitle | 0 | 1 | 1 | 196 | 0 | 124228 | 0 |
| originalTitle | 0 | 1 | 1 | 196 | 0 | 128171 | 0 |
| isAdult | 0 | 1 | 11 | 15 | 0 | 2 | 0 |
| genres | 0 | 1 | 2 | 31 | 0 | 1056 | 0 |
These all look complete.
Do these numbers make sense?
The number of individual responses (n_unique) for each character variable is a good source for sanity checks. For example, the largest number of unique values belongs to the title id variable (tconst = 136925), and this is identical to the number of rows in the dataset. The next largest number belongs to the originalTitle variable (128171), and the documentation tells us this variable is the title for the film in its original language. By itself, this number doesn’t tell us much, but we can see the next largest number (124228) is the film’s primaryTitle, and it makes sense that the number of unique responses for these two variables is almost the same.
It also makes sense that the n_unique for actor/actress (nconst) is close to the actor/actress primaryName. There should be way more titles (originalTitle or primaryTitle) than genres, and there are (1056).
Finally, we can see the two binary variables we read about above (category and isAdult) only list 2 unique values (in n_unique), so it appears we imported these variables correctly.
Next, we will review the mean, standard deviation (sd), minimum (p0), median (p50), maximum (p100), and hist for the numeric variables in ImdbData.
# click to execute code
SkimImdbData %>%
skimr::focus(numeric.mean, numeric.sd,
numeric.p0, numeric.p50, numeric.p100,
numeric.hist) %>%
skimr::yank("numeric")Variable type: numeric
| skim_variable | mean | sd | p0 | p50 | p100 | hist |
|---|---|---|---|---|---|---|
| birthYear | 1946.85 | 27.79 | 1839 | 1950.0 | 2015 | ▁▂▆▇▂ |
| averageRating | 6.00 | 1.16 | 1 | 6.1 | 10 | ▁▂▇▆▁ |
| numVotes | 6015.29 | 43514.99 | 10 | 125.0 | 2334927 | ▇▁▁▁▁ |
| startYear | 1985.28 | 26.46 | 1906 | 1990.0 | 2021 | ▁▂▃▅▇ |
| runtimeMinutes | 97.14 | 24.13 | 2 | 94.0 | 1500 | ▇▁▁▁▁ |
| age_lead | 38.43 | 12.95 | 1 | 36.0 | 98 | ▁▇▅▁▁ |
Do these numbers make sense?
The average
birthYearis1947, which is plausible considering the date range for movies in the IMDB (1906-2021).The average movie rating is a
6.00, which also makes sense considering it’s a 1-10 scale.The number of votes (
numVotes) is the most skewed variable because it ranges from10to2334927.The
startYearfor the movie has an average of1985, and increases steadily from1906to2021, making sense because more films are being made every year.The average length of each movie in
ImdbDatais97.1minutes (runtimeMinutes). But we can also see from thehistthat the range forruntimeMinutesincludes some very low and high values (p0=2andp100=1500).
The actor/actress’s average age is 38.4, with a low of 1 and a high of 98 (both plausible).
Creating categories from a numerical variable
We will proceed under the assumption that our stakeholders asked us to help explain the relationship between the average rating a movie received (averageRating) and the number of votes that went into the score (numVotes).
There are quite a few years in this dataset, so instead, we will split each measure into decades. To do this, we need a categorical variable from the startYear variable. The cut() function is handy because we can supply the number of breaks we want to split the numeric startYear variable into (12 in this case). We will also create some clear labels for this categorical variable with the labels argument and make sure the format is ordered.
We check our new factor variable with the fct_count() from the forcats package.
# click to execute code
ImdbData <- ImdbData %>%
mutate(year_cat10 = cut(x = startYear,
breaks = 12,
labels = c("1910s", "1920s", "1930s",
"1940s", "1950s", "1960s",
"1970s", "1980s", "1990s",
"2000s", "2010s", "2020s"),
ordered = TRUE))
# check the count of our factor levels
fct_count(f = ImdbData$year_cat10, sort = TRUE)#> # A tibble: 12 x 2
#> f n
#> <fct> <int>
#> 1 2020s 25898
#> 2 2010s 24430
#> 3 1990s 15196
#> 4 2000s 14914
#> 5 1970s 13253
#> 6 1980s 12466
#> 7 1960s 10204
#> 8 1940s 7714
#> 9 1950s 6809
#> 10 1930s 4102
#> 11 1920s 1615
#> 12 1910s 324
step 18
- included in step18.md?
- grammar?
- spelling?
Number of votes
We want to examine how the numVotes variable changed over time (year_cat10). We want each decade on the x axis and the numVotes for each film on the y. Let’s review the numVotes variable below with skimr::skim()
ImdbData$numVotes %>%
skimr::skim()| Name | Piped data |
| Number of rows | 136925 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 6015.29 | 43514.99 | 10 | 33 | 125 | 666 | 2334927 | ▇▁▁▁▁ |
We can see the values for this variable are concentrated, or skewed, towards 0. However, we know enough information to build our labels.
labs_numvotes_yearcat10 <- labs(
title = "Number of IMDB review votes by decade",
subtitle = "Internet Movie Database (IMDB)",
caption = "https://www.imdb.com/",
x = "Decade",
y = "Number of votes")Scatterplots with geom_jitter
We’re going to view the distribution of numVotes across each decade in year_cat10 with ggplot2::geom_jitter(). geom_jitter() geom is similar to geom_point(), except it adds “a small amount of random variation to the location of each point.” We will set the following arguments inside the geom_jitter() to help demonstrate how it works.
size = 0.9= this controls how small/large the point will be
width = 0.25= determines how wide we want the points to ‘jitter’ (the default value is.40, so we’re decreasing it slightly)
alpha = 1/6= the alpha controls the saturation (or transparency) of the points
show.legend = FALSE= remove the legend (it’s labeled across thexaxis)
# # click to execute code
gg_step18_jitter_01 <- ImdbData %>%
ggplot(aes(x = year_cat10,
y = numVotes,
fill = year_cat10)) +
geom_jitter(size = 0.9,
width = 0.25,
alpha = 1/6,
show.legend = FALSE) +
guides(fill = FALSE) +
# add labels
labs_numvotes_yearcat10
# # save
# ggsave(plot = gg_step18_jitter_01,
# filename = "gg-step18-jitter-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step18_jitter_01Open the gg-step18-jitter-01.png graph in the vscode IDE (above the Terminal console). Notice how there are a handful of extreme points above 15000000?
Data with extreme values like this can be removed or transformed. If we have a reason to exclude these data, we can do filter the data to only the range we want in the graph (say Number of votes < 1500000), and then include this in the graph.
Run the code below to see what this would look like:
# click to execute code
labs_jitter_numvote_yearcat10_02 <- labs(
title = "Number of IMDB review votes by decade",
subtitle = "Internet Movie Database (IMDB)",
caption = "*Number of votes < 1500000; https://www.imdb.com",
x = "Decade",
y = "Number of votes")
gg_step18_jitter_02 <- ImdbData %>%
filter(numVotes < 1500000) %>%
ggplot(aes(x = year_cat10,
y = numVotes)) +
geom_jitter(size = 0.9,
width = 0.25,
alpha = 1/6,
show.legend = FALSE) +
labs_jitter_numvote_yearcat10_02
# save
# ggsave(plot = gg_step18_jitter_02,
# filename = "gg-step18-jitter-02.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step18_jitter_02Open the gg-step18-jitter-02.png graph in the vscode IDE (above the Terminal console). Now the extreme points above 15000000 have been removed.
Another option is to transform the axis, which we can do with ggplot2::scale_y_log10(). This changes the y axis scale using a base-10 log transformation.
We will also use the handy label_log10 function developed by Claus Wilke
# click to execute code
# load label_log10 function
source("https://bit.ly/35Ywt2q")
# create new labels
labs_jitter_numvote_yearcat10_03 <- labs(
title = "Number of IMDB review votes by decade",
subtitle = "Internet Movie Database (IMDB)",
caption = "*Number of votes log10 transformed; https://www.imdb.com",
x = "Decade",
y = "log10(Number of votes)")
# create new graph
gg_step18_jitter_03 <- ImdbData %>%
ggplot(aes(x = year_cat10,
y = numVotes)) +
geom_jitter(size = 0.9,
width = 0.25,
alpha = 1/6,
show.legend = FALSE) +
scale_y_log10(
labels = label_log10) +
labs_jitter_numvote_yearcat10_03
# save
# ggsave(plot = gg_step18_jitter_03,
# filename = "gg-step18-jitter-03.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step18_jitter_03Open the gg-step18-jitter-03.png graph in the vscode IDE (above the Terminal console). We can see the log10 transformation spreads the points out more uniformly across the y axis, but is also makes it more difficult to interpret the number of votes.
Regardless of choice we make (removal of extreme values or transforming the scale), we want to make sure we communicate either of these changes to our stakeholders’ and include them in any reports or write-ups so we don’t give them a distorted view of the underlying variable.
Read more about transforming a scale in R for Data Science.
step 19
- included in step17.md?
- grammar?
- spelling?
Average individual user ratings
The averageRating gives us the weighted average of all the individual user ratings. We will refresh our memory about this variable by taking a quick look at the skimr::skim() output below:
ImdbData$averageRating %>%
skimr::skim()| Name | Piped data |
| Number of rows | 136925 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 6 | 1.16 | 1 | 5.3 | 6.1 | 6.8 | 10 | ▁▂▇▆▁ |
The distribution for averageRating looks evenly distributed around the mean and median, and the hist looks fairly symmetrical. We will see if this holds true for averageRating when we look at it across the levels of year_cat10.
We now know enough about what we’re expecting to see in the graph to build the labels for the average individual user ratings (averageRating) vs. time (year_cat10).
labs_avgratng_yearcat10 <- labs(
title = "Average IMDB individual user ratings by decade",
subtitle = "Internet Movie Database (IMDB)",
caption = "https://www.imdb.com/",
x = "Decade",
y = "Average individual user ratings")Violin plots with geom_violin
We’re going to view the distribution of averageRating using the ggplot2::geom_violin() function. This function takes the following arguments:
- map
year_cat10tofillinside the initialggplot(aes())functions
- set
show.legend = FALSEinside thegeom_violin()function
# # click to execute code
gg_step19_violin_01 <- ImdbData %>%
ggplot(aes(x = year_cat10,
y = averageRating,
fill = year_cat10)) +
geom_violin(show.legend = FALSE) +
# add labels
labs_avgratng_yearcat10
# # save
# ggsave(plot = gg_step19_violin_01,
# filename = "gg-step19-violin-01.png",
# device = "png",
# width = 9,
# height = 6,
# units = "in")
gg_step19_violin_01Open the gg-step19-violin-01.png graph in the vscode IDE (above the Terminal console). We can see the violin plot shows us the averageRating variable as a smoothed area, with tails extending out to the ends of the distribution (towards the ‘neck’ of the violin).
In the next section, we are going to examine how the numVotes and averageRating relate to each other over time.
step 20
- included in step17.md?
- grammar?
- spelling?
Small multiples
We’ve been exploring…
Votes vs. user rating
# # click to execute code
gg_step20_scatter_01 <- ImdbData %>%
ggplot(aes(x = numVotes,
y = averageRating,
geoup = year_cat10)) +
geom_point(size = 0.9,
alpha = 1/6,
show.legend = FALSE)
gg_step20_scatter_01log10(Votes) vs. user rating
gg_step20_scatter_02 <- ImdbData %>%
ggplot(aes(x = numVotes,
y = averageRating,
geoup = year_cat10)) +
geom_point(size = 0.9,
alpha = 1/6,
show.legend = FALSE) +
scale_x_log10(labels = label_log10)
gg_step20_scatter_02Facetting for multiple comparisons
If you wish to facilitate direct comparisons you ideally need to juxtapose individual frames within the same view. The common technique used to achieve this in visualization is through small multiples, where you repeat the same representation for each moment in time of interest and present them collectively in an adjacent view, often through a grid layout.
scale_x_continuous()
gg_step20_facet_01 <- ImdbData %>%
ggplot(aes(x = numVotes,
y = averageRating,
geoup = year_cat10)) +
geom_point(size = 0.5,
alpha = 1/10,
show.legend = FALSE) +
scale_x_continuous(limits = c(10, 2334927),
breaks = c(10, 1167464, 2334927),
labels = c('10', '~1.2M', '~2.3M')) +
facet_wrap(~ year_cat10)
gg_step20_facet_01This enables far more incisive comparisons.
facet_wrap
gg_step20_facet_02 <- ImdbData %>%
ggplot(aes(x = numVotes,
y = averageRating,
group = year_cat10)) +
geom_point(size = 0.5,
alpha = 1/10,
show.legend = FALSE) +
scale_x_log10(labels = label_log10) +
facet_wrap(~ year_cat10)
gg_step20_facet_02geom_smooth()
gg_step20_facet_03 <- ImdbData %>%
ggplot(aes(x = numVotes,
y = averageRating,
geoup = year_cat10)) +
geom_point(size = 0.5,
alpha = 1/10,
show.legend = FALSE) +
geom_smooth(se = FALSE,
method = 'lm',
size = 0.75,
color = "firebrick2",
fullrange = TRUE) +
scale_x_log10(labels = label_log10) +
facet_wrap(~ year_cat10)
gg_step20_facet_03#> `geom_smooth()` using formula 'y ~ x'
finish
- included in finish.md?
- grammar?
- spelling?
Communication (relationships)
A graph describing a relationship answers a certain kind of question. These questions can be i.e., “What is the relationship between two quantitative measures?”
When presenting a graph with relationships, consider the context and framing for the conclusions your audience will draw. Is this good news? For example, if the chart displays a drop in sales over time, anticipate how this will change your presentation’s tone, and be ready to answer questions.
It’s also important not to confuse your audience when designing graphs. Relationships with ‘good news’ should have the data points showing a positive trend (i.e., as X values increase, so do Y values), and vice-versa. You don’t want to find yourself in a situation where you’re explaining that your graph doesn’t show what you’re audience is seeing.
Other resources for missing data
Read more about visualizing missing data here and on the visdat package site, or on the inspectdf package website.